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ABSTRACT 

We consider the setting of a Semantic Web database, containing 
both explicit data encoded in RDF triples, and implicit data, im- 
plied by the RDF semantics. Based on a query workload, we ad- 
dress the problem of selecting a set of views to be materialized in 
the database, minimizing a combination of query processing, view 
storage, and view maintenance costs. Starting from an existing rela- 
tional view selection method, we devise new algorithms for recom- 
mending view sets, and show that they scale significantly beyond 
the existing relational ones when adapted to the RDF context. To 
account for implicit triples in query answers, we propose a novel 
RDF query reformulation algorithm and an innovative way of in- 
corporating it into view selection in order to avoid a combinatorial 
explosion in the complexity of the selection process. The interest 
of our techniques is demonstrated through a set of experiments. 

I. INTRODUCTION 

A key ingredient for the Semantic Web vision [4] is a data format 
for describing items from the real and digital world in a machine- 
exploitable way. The W3C's resource description framework (RDF, 
in short [26]) is a leading candidate for this role. 

At a first look, querying RDF resembles querying relational data. 
Indeed, at the core of the W3C's SPARQL query language for 
RDF [27] lies conjunctive relational-style querying. There are, 
however, several important differences in the data model. First, an 
RDF data set is a single large set of triples, in contrast with the typ- 
ical relational database featuring many relations with varying num- 
bers of attributes. Second, RDF triples may feature blank nodes, 
standing for unknown constants or URIs; an RDF database may, 
for instance, state that the author of X is Jane while the date of X 
is 4/1/2011, for a given, unknown resource X. This contrasts with 
standard relational databases where all attribute values are either 
constants or null. Finally, in typical relational databases, all data 
is explicit, whereas the semantics of RDF entails a set of implicit 
triples which must be reflected in query answers. One important 
source of implicit triples follows from the use of an (optional) RDF 
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Schema (or RDFS, in short [26]), to enhance the descriptive power 
of an RDF data set. For instance, assume the RDF database con- 
tains the fact that the driverLicenseNo of John is 12345, whereas an 
RDF Schema states that only a person can have a driverLicenseNo. 
Then, the fact that John is a person is implicitly present in the 
database, and a query asking for all person instances in the database 
must return John. 

The complex, graph-structured RDF model is suitable for de- 
scribing heterogeneous, irregular data. However, it is clearly not a 
good model for storing the data. Existing RDF platforms therefore 
assume a simple (application-independent) storage model, comple- 
mented by indexes and efficient query evaluation techniques [1, 15, 
16, 17, 20, 23], or by RDF materialized views [6, 9]. While indexes 
or views speed up the evaluation of the fragments of queries match- 
ing them, the query processor may still need to access the main 
RDF database to evaluate the remaining fragments of the queries. 

We consider the problem of choosing a ( relational) storage model 
for an RDF application. Based on the application workload, we 
seek a set of views to materialize over the RDF database, such that 
all workload queries can he answered based solely on the recom- 
mended views, with no need to access the database. Our goal is 
to enable three-tier deployment of RDF applications, where clients 
do not connect directly to the database, but to an application server, 
which could store only the relevant views; alternatively, if the views 
are stored at the client, no connection is needed and the application 
can run off-line, independently from the database server. 

RDF datasets can be very different: data may be more or less 
structured, schemas may be complex, simple, or absent, updates 
may be rare or frequent. Moreover, RDF applications may differ 
in the shape, size and similarity of queries, costs of propagating 
updates to the views etc. To capture this variety, we characterize 
candidate view sets by a cost function, which combines (i) query 
evaluation costs, (ii) view maintenance costs and (Hi) view storage 
space. Our contributions are the following: 

1. This is the first study of RDF materialized view selection sup- 
porting the rewriting of all workload queries. We show how to 
model this as a search problem in a space of states, inspired from a 
previous work in relational data warehousing [21]. 

2. Implicit triples entailed by the RDF semantics [26] must be re- 
flected in the recommended materialized views, since they may par- 
ticipate to query results. Two methods are currently used to include 
implicit tuples in query results. Database saturation adds them to 
the database, while query reformulation leaves the database intact 
and modifies queries in order to also capture implicit triples. Our 
approach requires no special adaptation if applied on a saturated 
database. For the reformulation scenario, we propose a novel RDF 
query reformulation algorithm. This algorithm extends the state of 
the art in query processing in the presence of RDF Schemas [3, 
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5], and is a contribution applying beyond the context of this work. 
Moreover, we propose an innovative method of using reformulation 
(called post-reformulation) which enables us to efficiently take into 
account implicit triples in our view selection approach. 

3. We consider heuristic search strategies, since the complexity 
of complete search is extremely high. Existing strategies for rela- 
tional view selection [21] grow out of memory and fail to produce 
a solution when the number of atoms in the query workload grows. 
Since RDF atoms are short (just three attributes), RDF queries are 
syntactically more complex (they have more atoms) than relational 
queries retrieving the same information, making this scale problem 
particularly acute for RDF. We propose a set of new strategies and 
heuristics which greatly improve the scalability of the search. 

4. We study the efficiency and effectiveness of the above algo- 
rithms, and their improvement over existing similar approaches, 
through a set of experiments. 

This paper is organized as follows. Section 2 formalizes the 
problem we consider. Section 3 presents the view selection prob- 
lem as a search problem in a space of candidate states, whereas 
Section 4 discusses the inclusion of implicit RDF triples in our ap- 
proach. Section 5 describes the search strategies and heuristics used 
to navigate in the search space. Section 6 presents our experimental 
evaluation. Section 7 discusses related works, then we conclude. 

2. PROBLEM STATEMENT 

In accordance with the RDF specification [26], we view an RDF 
database as a set of (s,p,o) triples, where s is the subject, p the 
property, and o the object. RDF triples are well-formed, that is: 
subjects can be URIs or blank nodes, properties are URIs, while 
objects can be URIs, blank nodes, or literals (i.e., values). Blank 
nodes are placeholders for unknown constants (URIs or literals); 
from a database perspective, they can be seen as existential vari- 
ables in the data. While relational tuples including the null token, 
commonly used to represent missing information, do not join {null 
does not satisfy any predicate), RDF triples referring to the same 
blank node may be joined to construct complex results, as exempli- 
fied in the Introduction. Due to blank nodes, an RDF database can 
be seen as an incomplete relational database consisting of a single 
triple table t(s, p, o), under the open-world assumption [2]. 

To express RDF queries (and views), we consider the basic graph 
pattern queries of SPARQL [27], represented wlog as a special case 
of conjunctive queries: conjunctions of atoms, the terms of which 
are either free variables (a.k.a. head variables), existential variables, 
or constants. We do not use a specific representation for blank 
nodes in queries, although SPARQL does, because they behave ex- 
actly like existential variables. 

Definition 2.1 (RDF queries/views). An RDF query 
(or view) is a conjunctive query over the triple table t(s,p, o). 

We consider wlog queries without Cartesian products, i.e., each 
triple shares at least one variable (joins at least) with another triple. 
We represent a query with a Cartesian product by the set of its in- 
dependent sub-queries. Finally, we assume queries and views are 
minimal, i.e., the only containment mapping from a query (or view) 
to itself is the identity [7]. 

As a running example, we use the following query qi , which asks 
for painters that have painted "Starry Night" and having a child that 
is also a painter, as well as the paintings of their children: 

qi{X, Z):—t(X, hasPainted, starry Night) ,t(X , isParentOf, Y), 
t(Y, hasPainted, Z) 

Based on views, one can rewrite the workload queries: 

Definition 2.2 (Rewriting). Let q be an RDF query and 
V — {vi,V2, . . . ,v k } be a set of RDF views. A rewriting of q 



based on V is a conjunctive query (i) equivalent to q (i.e., on any 
data set, it yields the same answers as q), (ii) involving only rela- 
tions from V and (Hi) minimal, in the sense mentioned above. 

We are now ready to define our view selection problem, which 
relies on candidate view sets: 

Definition 2.3 (Candidate view set). Let Q be a set of 
RDF queries. A candidate view set for Q is a pair {V, R) such that: 

• V is a set of RDF views, 

• Ris a set of rewritings such that: (i)for every query q G Q 
there exists exactly one rewriting r G R of q using the views 
in V ; (ii) all V views are useful, i.e., every view v G V 
participates to at least one rewriting r G R. 

We consider a cost estimation function c e which returns a quan- 
titative measure of the costs associated to a view set. The lower 
the cost, the better the candidate view set is. Our cost components 
include the effort to evaluate the view-based query rewritings, the 
total space occupancy of the views and the view maintenance costs 
as data changes. More details about c E are provided in Section 3.3. 

Definition 2.4 (View selection problem). Let Q = 
{qi, Q2, ■ ■ ■ , q n } be a set of RDF queries and c e be a cost esti- 
mation function. The view selection problem consists in finding a 
candidate view set (V, R) for Q such that, for any other candidate 
view set (V , R') for Q: c e «U,i?)) < c € ((V, R')). 

3. THE SPACE OF CANDIDATE VIEW SETS 

This Section describes our approach for modeling the space of 
possible candidate view sets. Section 3.1 introduces the notion of a 
state to model one such set, while Section 3.2 presents a set of tran- 
sitions that can be used to transform one state to another. Finally, 
Section 3.3 shows how to assign a cost estimation to each state. 

3.1 States 

We use the notion of state to model a candidate view set together 
with the rewritings of the workload queries based on these views. 
The set of all possible candidate view sets, then, is modeled as a 
set of states, which we adapt from the previous work on material- 
ized view selection in a relational data warehouse [21]. From here 
forward, given a workload Q, we may use S(Q) (possibly with 
subscripts or superscripts) to denote a candidate view set for Q. To 
ease the exposition, we also employ from [21] a visual representa- 
tion of each state by means of a state graph. 

Definition 3.1 (State graph). Given a query set Q and 
a state Si(Q) = (Vi,Ri), the state graph G(Si) = (Ni,Ei) is a 
directed multigraph such that: 

• each triple U appearing in a view v € V; is represented by a 
node m G Ni; 

• let ti and tj be two triples in a view v G Vi, and a join on 
their attributes ti.at andtj.aj (where a i; aj G {s,p,o}). 
For each such join, there is an edge G Ei connecting the 
respective nodes G Ni and labeled v.ni.cii = nj.dj. 
We call ej a join edge; 

• let ti be a triple in a view v G Vi and n t G iVj be its cor- 
responding node. For every constant a that appears in the 
attribute dj G {s,p, o} of U, an edge labeled v:nt.ai = Cj 
connects ni to itself. Such an edge is called selection edge. 

The graph of v is defined as the subgraph of G(Si) correspond- 
ing to v. Observe that in a view, two nodes may be connected by 
several join edges if their corresponding atoms are connected by 
more than one join predicates. 
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Figure 1: Sample initial state graph So, and states attained through successive transitions. 



We define two states to be equivalent if they have the same view 
sets. Furthermore, to avoid a blow-up in the storage space required 
by the views, we do not consider views including Cartesian prod- 
ucts. In a relational setting, some Cartesian products, e.g., between 
small dimension tables in an OLAP context, may not raise perfor- 
mance issues. In contrast, in the RDF context where all data lies in 
a single large triple table, views with Cartesian products are likely 
not interesting and their storage overhead is prohibitive. The ab- 
sence of Cartesian products from our views entails that the graph 
of every view is a connected component of the state graph. 

As a (simple) example, consider the state So(Q) = {{vi}, Ro), 
where Q — {qi} is a workload containing only the previously in- 
troduced query q lt and vi = qi. The rewriting set R consists of 
the trivial rewriting {qi = vi}. The graph G(So) is depicted at left 
in Figure 1 , and since it corresponds to a single view, it comprises 
only one connected component. 

3.2 State Transitions 

To enumerate candidate view sets (or, equivalently, states), we 
use four transitions, inspired from [21]. As we show in Section 5.1, 
our transition set is complete, i.e., all possible states for a given 
workload can be reached through our four transitions. The first 
three transitions remove predicates from views, thus can be seen as 
"relaxing", and may split a view in two, increasing the number of 
views. The last one factorizes two views into one, thus reducing the 
number of workload views. The graphs corresponding to the states 
before and after each transition are illustrated in Figure 1. 

We use v.e to denote an edge e belonging to the view v in a state 
graph. While we define rewritings as conjunctive queries, for ease 
of explanation, we now denote rewritings by (equivalent) relational 
algebra expressions. We use a e to denote a selection on the condi- 
tion attached to the edge e in a view set graph. Since the query set 
Q is unchanged across all transitions, we omit it for readability. 

Definition 3.2 (View Break (Vb)). Let S = (V, R) be a 
state, v a view in V and N v the set of nodes of the graph ofv with 
\N V \ > 2. Let N V1 , N V2 be two subsets of N v such that: 

• N V1 £ N V2 andN V2 £ N V1 ; 

• N V1 U N V2 = N v ; 

• the subgraph of the graph ofv defined by N V1 ( respectively, 
by N V2 ) and the edges between these nodes is connected. 

We create two new views, Vi and v 2 . View Vi (respectively Vi) 
derives from the graph ofv by copying the nodes corresponding to 
N V1 (N V2 ) and the edges between them. The head variables ofvi 
(V2) are those ofv appearing also in the body of vi (vi), together 
with all additional variables appearing in the nodes N V1 n N V2 . 

The new state S' = (V' ,R') consists of: 

• V = (V\M)U{«l,W2}, 

• G(S') is obtained from G(S) by removing the graph ofv 
and adding those of vi and vi, and 

• R' is obtained from R by replacing all the occurrences ofv, 
with TT head ( v )(vi txs V2), where cxi is the natural join. 

For example, we apply a view break on the view vi of state So 
introduced in the previous Section, and obtain the new state Si : 

Si = {{v 2 ,v 3 }, {qi = n head ( vi )(v2 cxi i> 3 )}) 



Definition 3.3 (Selection Cut (Sc)). Let S = (V,R) 
be a state and v.e be a selection edge in G(S). A selection cut 
on e yields a state S' = (V , R') such that: 

• V' is obtained from V by replacing v with a new view v', in 
which the constant in the selection edge e has been replaced 
with afresh head variable (i.e., is returned by v' , along with 
the variables returned by v), 

• G(S') is obtained from G(S) by removing the graph ofv 
and adding the one ofv', and 

• R' is obtained from R by replacing all occurrences ofv with 
the expression n head ( v )(o- e (v')). 

For instance, we apply a selection cut on the edge labeled 
V2'.ni.o = starry Night of G(Si) and obtain the state S2, in which 
V2 is replaced by a new view v±: 

S2 = {{v s ,v 4 }, 

{qi = •Khead(v 1 ){^head(v 2 ){°'ni.o=starryNight{Vi)) IX V 3 )}) 

Definition 3.4 (Join Cut (Jc)). Let S = (V, R) be a state 
and v.e be a join edge in G[S) of the form m.Ci = rij .Cj, such that 
Ci,Cj € {s, p, o}. A join cut on e yields a state S' = (V' , R'), ob- 
tained as follows: 

1. If the graph of v is still connected after the cut, V' is ob- 
tained from V by replacing v with a new view v' in which the 
variable corresponding to the join edge e becomes a head 
variable, and the occurrence of that variable correspond- 
ing to rii.Ci is replaced by a new fresh head variable. The 
new rewriting set R' is obtained from R by replacing v by 
Tthead(v)(o~e(v')). The new graph G(S') is obtained from 
G(S) by removing the graph ofv and adding the one ofv'. 

2. If the graph ofv is split in two components, V' is obtained 
from V by replacing v with two new symbols v[ and v' 2 , each 
corresponding to one component. In each ofv[ and v'2, the 
join variable of e becomes a head variable. The new rewrit- 
ing set R' is obtained from R by replacing v by TTh ea d(v) ( v 'i 
cxi e v'2). The new graph G(S') is obtained from G(S) by 
removing the graph ofv and adding the ones of v[ and v' 2 . 

For example, cutting the join edge Vi'.ni.s = ri2.s of G(5 I 2) 
disconnects the graph of V4, resulting in two new views, v$ and ve 
(see Figure 1). View symbol V4 is replaced in the rewritings by 
the expression n head ( V4 -j(v 5 Cxl ni . s=n2 . 3 v 6 ). If we continue by 
cutting the edge v^-^.o = nz.s, Vz is split into vj and v$. The 
resulting state S3 is: 

S3 = {{v 5 ,v 6 ,v 7 ,vs}, 

{Ql ^head(vi J {^head(v 2 ) (^ni .0— starryNight ij^head{v^) ( 

v 5 cxi ni . s= „ 2 . s ve))) cxi n head ( V3 )(vT cxi n4 . 0=n3 . s « 8 )}} 

Definition 3.5 (View Fusion (Vf)). Let S = {V, R) be 
a state and vi,v 2 be two views in V such that their respective 
graphs are isomorphic (their bodies are equivalent up to variable 
renaming). We denote by (i-tj) the renaming of the variables of 
Vi into those ofvj. Let V3 be a copy ofvi, such that head(vs) — 
head(vi) U head(v2(2^i)). Fusing vi andv2 leads to a new state 
S' — (V' , R') obtained as follows: 
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. V = (V\{v 1 ,v 2 })U{v 3 }, 

• G(S') is obtained from G(S) by removing the graphs of vi 
and v 2 and adding that ofvz, and 

• R' is obtained from R by replacing any occurrence ofvi with 

7Th e ad( t , 1 )(«3), and of V 2 With TThead(v 2 )(v3{3^2)) 

For example, in state S3, the graphs of V5 and v$ are isomorphic, 
and can thus be fused creating the new view vg. Similarly, v& and 
vr can be fused into a new view vio leading to state S4. 

Our transitions adapt those introduced in [21] to our RDF view 
selection context; the differences are detailed in [25]. 

3.3 Estimated State Cost 

To each state, we associate a cost estimation c € , taking into ac- 
count: the space occupancy of all the materialized views, the cost 
of evaluating the workload query rewritings, and the cost associated 
to the maintenance of the materialized views. 

For any conjunctive query or view v, we use len(v) to denote 
the number of atoms in v, \v\ for the number of tuples in v and \v\ € 
for our estimation of this number. Let S(Q) ={V, R) be a state. 
View space occupancy (VSO e ) To estimate the cardinality of a 
given view v G V, we adopt the solution of [16], which consists in 
counting and storing the exact number of tuples (i) for each given 
s, p and o value; (ii) for each pair of (s, p), (s, o) and (s, p) values. 
This leads to exact cardinality estimations for any 1-atom view with 
1 or 2 constants. The size of an 1-atom view with no constants is 
the size of the data set; three-constants atoms are disallowed in our 
framework since they introduce Cartesian products in views. 

We now turn to the case of multi-atom views. From each view 
v <= V, and each atom U G v, 1 < i < len(v), let v' be the 
conjunctive query whose body consists of exactly the atom t t and 
whose head projects the variables in ti. From our gathered statis- 
tics, we know |v*|. We assume that values in each triple table col- 
umn are uniformly distributed, and that values of different columns 
are independently distributed 1 . For the s, p and o columns, more- 
over, we store the number of distinct values, as well as the mini- 
mum and maximum values. Then, we compute \ v\ e based on the 
exact counts \ v l | and the above assumptions and statistics, applying 
known relational formulas [18]. Finally, we use the average size of 
a subject, property, respectively object, the attributes in the head of 
v, and \v\ c , to estimate the space occupancy of view v. 

Since the workload is known, we gather only the statistics needed 
for this workload: (i) we count the triples matching each of the 
query atoms (ii) we also count the triples matching all relaxations 
of these atoms, obtained by removing constants (as Sc does during 
the search). Consider, for instance, the following query: 

q(X\, X2):— t(X\, rdf -type, picture) , t(X\ , isLocatln, X2) 

We count the triples matching the two query atoms: 

q 1 (Xi):—t(X 1 , rdf -.type, picture), q 2 (Xi, X2):— t(X\, isLocatln, X2) 
as well as the triples matching three relaxed atoms, obtained by 
removing the constants from q 1 and q 2 : 

q 3 {X 1 ,X2):-t(X 1 ,rdf:type,X2),q 4 (X 1 ,X2):-t{X u X2,picture), 
q 5 (X 1 ,X2,X 3 ):-t(X 1 ,X 2 ,X 3 ). 

Based on the cardinalities of the above atoms, we can estimate the 
cardinality of any possible view created throughout the search. 
Rewriting evaluation cost (REC € ) This cost estimation reflects 
the processing effort needed to answer the workload queries using 
the proposed rewritings in R. It is computed as: 

REC'(S) = E r efl (ci ■ io^r) + c 2 • cpu^r)) 

'A very recent work [14] provides an RDF query size estimation 
method which does not make the independence assumption. This 
estimation method could easily be integrated in our framework. 



where io e (r) and cpu e (r) estimate the I/O cost and the CPU pro- 
cessing cost of executing the rewriting r respectively, and ci, c 2 are 
some weights. The I/O cost estimation is: 

where v G r denotes a view appearing in the rewriting r. 

The CPU cost estimation cpu e (r) sums up the estimated costs 
of the selections, projections, and joins required by the rewriting 
r, computed based on the view cardinality estimations and known 
formulas from the relational query processing literature [18]. 
View maintenance cost (VMC e ) The cost of maintaining the views 
in V when the data is updated depends on the algorithm imple- 
mented to propagate the updates. In a conservative way, we chose 
to account only for the costs of writing/removing tuples to/from the 
views due to an update, ignoring the other maintenance operation 
costs. Consider the addition of a triple t+ to the triple table, and 
a view v of len(v) atoms. With some simplification, we consider 
that t+ joins with /1 existing triples for some constant /1, the tu- 
ples resulting from this, in turn, join with f 2 existing triples etc. 
Adding the triple t + thus causes the addition of /1 • f 2 ■ ■ ■ ■ ■ fien(v) 
tuples to v. A similar reasoning holds for deletions. To avoid esti- 
mating fi, f 2 , . . . , fi en ( v ), which may be costly or impossible for 
triples which will be added in the future, we consider a single user- 
provided factor /, and compute: 

VMC e (S) = J2 v evf' en(v) 
The estimated cost c c of a state S is defined as: 

c e (S) = c s ■ VSO'(S) + c r ■ REC"(S) + c m ■ VMC"(S) 
where the numerical weights c 3 , c r and c m determine the impor- 
tance of each component: if storage space is cheap c 3 can be set 
very low, if the triple table is rarely updated c m can be reduced etc. 
Impact of transitions on the cost Transition Sc increases the view 
size and adds to some rewritings the CPU cost of the selection. 
Thus, Sc always increases the state cost. Transitions Jc and Vb 
may increase or decrease the space occupancy, and add the costs of 
a join to some rewritings. Jc decreases maintenance costs, whereas 
Vb may increase or decrease it. Overall, Jc and Vb may increase 
or decrease the state cost. Finally, Vf decreases the view space 
occupancy and view maintenance costs. Query processing costs 
may remain the same or be reduced, but they cannot increase. Thus, 
Vf always reduces the overall cost of a state. 

4. VIEW SELECTION & RDF REASONING 

The approach described so far does not take into consideration 
the implicit triples that are intrinsic to RDF and that complete query 
answers. Section 4. 1 introduces the notion of RDF entailment to 
which such triples are due. Section 4.2 presents the two main meth- 
ods for processing RDF queries when RDF entailment is consid- 
ered, namely database saturation and query reformulation. In par- 
ticular, we devise a novel reformulation algorithm extending the 
state of the art. Finally, Section 4.3 details how we take RDF en- 
tailment into account in our view selection approach. 

4.1 RDF entailment 

The W3C RDF recommendation [26] provides a set of entail- 
ment rules, which lead to deriving new implicit (or entailed) triples 
from an RDF database. We provide here an overview of these rules. 

Some implicit triples are obtained by generalizing existing triples 
using blank nodes. For instance, a triple (s,p,o) entails the triple 
(_:&, p, o), where s is a URI and _:b denotes a blank node. 

Some other rules derive implicit triples from the semantics of a 
few special URIs, which are part of the RDF standard, and are as- 
signed special meaning. For instance, RDF provides the rdfs: Class 
URI whose semantics is the set of all RDF-specific (predefined) 
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=> <x)) 


Range typing of a 
property 


(p. rdfs:range, c) 


VXW(p(X, Y) 
=> c(Y)) 



Algorithm 1: Reformulate^, S) 



Table 1: Semantic relationships expressible in an RDFS. 

and user-defined URIs denoting classes to which resources may be- 
long. When, for example, a triple states that a resource u belongs to 
a given user-defined class painting, i.e., (u, rdf :type, painting) 
using the predefined URI rdfitypc, an implicit triple states that 
painting is a class: (painting, rdf :type, rdfs:Class). 

Finally, some rules derive implicit triples from the semantics en- 
capsulated in an RDF Schema (RDFS for short). An RDFS speci- 
fies semantic relationships between classes and properties used in 
descriptions. Table 1 shows the four semantic relationships allowed 
in RDF, together with their first-order logic semantics. Some rules 
derive implicit triples through the transitivity of class and property 
inclusions, and of inheritance of domain and range typing. For in- 
stance, if painting is a subclass of masterpiece, i.e., (painting, 
rdfs:subClassOf , masterpiece), which is a subclass of work, 
i.e., (masterpiece, rdfsisubClassOf, work), then an entailed 
triple is (painting, rdfs:subClassOf , work). If hasPainted is a 
subproperty of hasCreated, i.e., (hasPainted, 
rdfs:subPropertyOf , hasCreated), the ranges of which are the 
classes painting and masterpiece respectively, i.e., 
(hasPainted, rdfs:range, painting) and (hasCreated, 
rdfs:range, masterpiece), then those triples are implicit: 
(hasPainted, rdfs:range, masterpiece), (hasPainted, 
rdfs:range, work), and (hasCreated, rdfs:range, work). Some 
other rules use the RDFS to derive implicit triples by propagat- 
ing values (URIs, blank nodes, and literals) from subclasses and 
subproperties to their superclasses and superproperties, and from 
properties to classes typing their domains and ranges. If a re- 
source u has painted something, i.e., (u, hasPainted, _:&), im- 
plicit triples are: (u, hasCreated, j.b), (_:&, rdf :type, painting), 
(_:&, rdf :type, masterpiece), and (_:6, rdf :type, work). 

Returning complete answers requires considering all the implicit 
triples. In practice, RDF data management frameworks (e.g., Jena 2 ) 
allow specifying the subset of RDF entailment rules w.r.t. which 
completeness is required. This is because the implicit triples brought 
by some rules, e.g., generalization of constants into blank nodes, 
may not be very informative in most settings. Of particular inter- 
est among all entailment rules are usually those derived from an 
RDFS, since they encode application domain semantics. 

4.2 RDF entailment and query answering 

We consider here the two main approaches previously proposed 
to answer queries w.r.t. a given set of RDF entailment rules: data- 
base saturation and query reformulation. 

Database saturation The first approach saturates the database by 
adding to it all the implicit triples specified in the RDF recommen- 
dation [26]. The benefit of saturation is that standard query eval- 
uation techniques for plain RDF can be applied on the resulting 
database to compute complete answers [27]. Saturation also has 
drawbacks. First, it needs more space to store the implicit triples, 
competing with the data and the materialized views. Observe that 
saturation adds all implicit triples to the store, whether user queries 
need them or not. Second, the maintenance of a saturated database, 

2 http://jena.sourceforge.net/ 



//rule 1 



Input : an RDF schema S and a conjunctive query q over S 
Output: a union of conjunctive queries ucq such that for any 
database D: 

evaluate^, saturate(D, S)) = evaluate(«eg, D) 

1 ucq <— {q}, ucq' «— 

2 while ucq ^ ucq' do 

3 ucq' <— ucq 
foreach conjunctive query q' 6 ucq' do 

foreach atom g in q' do 

if g = t(s, rdf :type, 02) and 
ci rdfs/subClassOf C2 S S then 
[ucq ^ ucqU {q' [g/t(3 rdf . type ci)] } 

if g = t(s,p2, o) andp\ rdfs:subPropertyOf p2 S S 
then ucq<^ucqU {q' [g/t(3 pi o)] } //rule 2 

if g = t(s, rdf '-.type, c) and p rdfs/domain c g S then 

L UCq <~ UCq U Hs/3X t(s,p,X)]} // mle 3 

if g = t(o, rdf -.type, c) and p rdfs.'range c£5 then 

L UCq UCq U H 9 /3X t(X,p,o)] > // mle 4 

if g = t(s, rdf -.type, X) and ci, C2 • . . , On are all the 
classes in S then // rale 5 

L UCq <~ UCq{J ^=ltt q '[9/t(s,rdf:ty P e,c t )})aHX/c % ]} 

if g = t(s, X, o) and p\,p2 ■ ■ ■ , p m are all the 
properties in S then //rale 6 

ucq <- ucq U Ufci{(9[ g/t ( s , Pi , o) ]) CT =[X/ w ]} U 

{ ^ q [g/t(s,rdf:type,o)]^^l x /rdf -.type]} 



17 return ucq 



which can be seen as an inflationary fixpoint, when adding or re- 
moving data and/or RDFS statements may be complex and costly. 
Finally, saturation is not always possible, e.g., when querying is 
performed at a client with no write access to the database. 

Query reformulation The second approach reformulates a (con- 
junctive) query into an equivalent union of (conjunctive) queries. 
The complete answers of the initial query (w.r.t. the considered 
RDF entailment rules) can be obtained by standard query evalu- 
ation techniques for plain RDF [27] using this union of queries 
against the non-saturated database. 

The benefit of reformulation is leaving the database unchanged. 
However, reformulation has an overhead at query evaluation time. 

Query reformulation w.r.t. an RDFS Query reformulation algo- 
rithms have been investigated in the literature for the well-known 
Description Logic fragment of RDF [3, 5]: datasets with RDFSs, 
without blank nodes, and where RDF entailment only considers 
the rules associated to an RDFS (those of the third kind in Section 
4.1). However, these algorithms allow reformulating queries from 
a strictly less expressive language than the one of our RDF queries 
(see Section 7 for more details) and, thus, cannot be applied to our 
setting. We therefore propose the Algorithm 1 that fully captures 
our query language, so that we can obtain the complete answers of 
any RDF query by evaluating its reformulation. 

The algorithm uses the set of rules of Figure 2 to unfold the 
queries; in this Figure and onwards, we denote by s, p, respec- 
tively, o, a placeholder for either a constant or a variable occur- 
ring in the subject, property, respectively, object position of a triple 
atom. Notice that rules (l)-(4) follow from the four rules of Ta- 
ble 1 . The evaluate and saturate functions, used in Algorithm 1 
provide, respectively, the standard query evaluation for plain RDF, 
and the saturation of a data set w.r.t. an RDFS (Table 1). Moreover, 
q[g/ g i] is the result of replacing the atom g of the query q by the 
atom g' and q a =[x/c] is the result of replacing any occurrence of 
the variable X in q with the constant c. 
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t(s, rdf:type, ci) => t(s,rdf:type, C2), 

with ci rdf s.subC'lassO f C2 d S (1) 

t(s,pi,o) => t(s,p2, o),withpi rdfs:subPropertyOfp2 £ 5 (2) 

t(s,p, X) => t(s, rdf:type, c), with p rdfs:domain c £ 5 (3) 

p, o) => t(o, rdf :type, c), with p rdf s:range c £ S (4) 

t(s,rdf:type,Ci) => t(s, rdf:type, X), for any class c; of 5 (5) 

t(s,pi, o) => t(s, X, o), for any property of 5 and rdf -.type (6) 

Figure 2: Reformulation rules for an RDFS S. 

More precisely, Algorithm 1 uses the rules in Figure 2 to gener- 
ate new queries from the original one, by a backward application 
of the rules on the query atoms. It then applies the same procedure 
on the newly obtained queries until no new queries can be con- 
structed, and then outputs the union of the generated queries. The 
inner loop (lines 5-16) comprises six ;/ statements, one for each of 
the six rules above. The conditions of these statements represent 
the heads (right parts) of the rules, whereas the consequents corre- 
spond to their bodies (left parts). In each iteration, when a query 
atom matches the condition of an if, the respective rule is triggered, 
replacing the atom with the one of the body of the rule. Note that 
rules 5 and 6 need to bind a variable X of an atom to a constant c;, 
Pi, or rdf -.type, thus use a to bind all the occurrences of X in the 
query in order to retain the join on X within the whole new query. 

Theorem 4.1 (Termination of Reformulate^, 5)). 
Given a query q over an RDFS S, Reformulate^, S) terminates 
and outputs a union of no more than (2|5| 2 ) m queries, where \S\ 
is the number of statements in S and m the number of atoms in q. 

The proof can be found in the technical report [25]. This theorem 
also exhibits that the query reformulation is polynomial in the size 
of the schema and exponential in the size of the query. 

Theorem 4.2 (Correctness of Algorithm 1). Letucq 
be the output «/Reformulate(g, <S), for a query q over an RDFS 
S. For any database D associated to S: 

evaluate^, saturate(Z), 5)) = evaluate(ucg, D). 

Again the proof is delegated to the technical report [25]. 

4.3 View selection aware of RDF entailment 

We now discuss possible ways to take RDF implicit triples into 
account in our view selection approach. As will be explained, the 
exact way cardinality statistics are collected for each view atom, 
described first in Section 3.3, play an important role here. 

Database saturation If the database is saturated prior to view se- 
lection, the collected statistics do reflect the implicit triples. 

Pre-reformulation Alternatively, one could reformulate the query 
workload and then apply our search on the new workload. To 
do so, we extend the definition of our initial state, as well as our 
rewriting language to that of unions of conjunctive queries. More 
precisely, given a set of queries Q = {qi, . . . ,q n }, and assum- 
ing that Reformulate (g^, 5) = {q} , . . . ,q™'}, it is sufficient to 
define So(Q) = (Vo,Ro) as the set of conjunctive views Vb = 
{J7=i{li > ■ ■ ■ > 1^} an ^ me set of rewritings Ro = 
UILite = 9i U • • • U ?"*}. In this case, statistics are collected on 
the original (non-saturated) database for the reformulated queries. 

As stated in Theorem 4.1, query reformulation can yield a signif- 
icant number of new queries, increasing the number of views of our 
initial state and leading to a serious increase of the search space. As 
an example, the following simple query on the Barton [24] dataset 

q(Xi,X 2 , X 3 ):-t(Xi, rdf -.type, text),t(X 1 ,relatedTo, X 2 ), 

t(X2, rdf .type, subjectPart) , t(Xi , language, fr), 
t(X 2 , description, X3) 





l'{Xi) 


—t(Xi, rdf ''.type, picture) (1) 
— t(Xi, rdf :type, painting) (2) 




q\x 1 ,x 2 ) 

U q 4 (Xi, isLocatln) 
U q l (X 1 ,isExpIn) 
Uq 1 {X 1 , rdf -.type) 

U q^Xi, isLocatln) 
Uq i (X 1 , rdf -.type) 


— (Xi, X 2 , picture) (1) 
—t(Xi,isLocatIn, picture) (2) 
—t(X 1 ,isExpIn, picture) (3) 
—t{X\, rdf '.type, picture) (4) 
-t{Xi,isExpIn, picture) (5) 
-t(Ai, rdf '-.type, painting) (6) 



Table 2: Term reformulation for post-reasoning. 



is reformulated with the Barton schema into a union of 104 queries. 
Given the very high complexity of the exhaustive search problem 
(Section 5.1), such an increase may significantly impact view se- 
lection performance. 

Post-reformulation To avoid this explosion, we propose to apply 
reformulation not on the initial queries but directly on the views in 
the final (best) state recommended by the search. 

Directly doing so introduces a source of errors: since statistics 
are collected directly on the original database, and the queries are 
not reformulated, the implicit triples will not be taken into account 
in the cost estimation function c e . To overcome this problem, we 
reflect implicit triples to the statistics, by reformulating each view 
atom v % into a union of atoms Reformulate(w l , 5) prior to the 
view search, and then replacing \v % \ (i.e., the cardinality of v % ) in 
our cost formulas with |Reformulate(i/, S)\. This results in hav- 
ing the same statistics as if the database was saturated. Then, we 
perform the search using the (non-reformulated) queries and get 
the same best state as in the database saturation approach (as we 
use the same initial state and statistics). Since materializing the 
best state's views directly would not include the implicit triples, we 
need to reformulate these views first. Theorem 4.2 guarantees the 
correctness of post-reformulation (materializing the reformulated 
views on the non-saturated database is the same as materializing 
the non-reformulated ones on the saturated database). 

Consider the query q of Section 3.3, with the following schema: 

S = {painting rdfs:subClassOf picture, 

isExpIn rdfs:subPropertyOf isLocatln} 

We first count (see Section 3.3) the exact number of triples match- 
ing the query atoms and their relaxed versions, namely q 1 to q 5 . 

We now reformulate each q % based on S into a union of queries, 
denoted q l,s . Table 2 illustrates this for q 1 and q 4 (for space rea- 
sons, we omit the other similar terms). Rule 1 (Figure 2) has been 
applied on q 1 , adding to it a second union term. Applying rule 6 on 
q 4 leads to replacing X2 with isLocatln, isExpIn, and rdf -.type 
respectively in the second, third and fourth union terms of q 4 ' s . In 
turn, the second term triggers rule 2 producing a fifth term, while 
the fourth term triggers rule 1 to produce the sixth union term. 

The cardinality of each reformulated atom q l,s is estimated prior 
to the search. Then, we perform the search for the non-reformulated 
version of q using these statistics, and get the following best state: 

v 1 (X 1 ,X 2 ):-t(X 1 , rdf -.type, X 2 ), v 2 (X 1 ,X 2 ):-t(X 1 , isLocatln, X 2 ) 

T3 = / K vl .x 1 ,V2.X 2 {<7X2=picture(vi) M V1 .X 1= v 2 .Xi "2) 

After the search has finished, instead of the recommended views 
vi and 1)2, we materialize their reformulated variants v[ and w 2 : 

v[ (Xi,X 2 ):-t(Xi,rdf:type, X 2 ) 
U v[(Xi , painting):— t(Xi , rdf :type, painting) 
U v[ (Xi , picture) :—t(Xi , rdf -.type, picture) 
U v[ (Xi , picture) :—t(Xi , rdf -.type, painting) 

v' 2 (Xi, X 2 ):—t(Xi, isLocatln, X 2 ) 
U v' 2 (X 1 ,X 2 ):-t(X 1 , isExpIn,X 2 ) 

Executing r^ on v[ and v' 2 provides the complete answers for q. 
In post-reformulation, finding the best state does not require sat- 
urating the database nor multiplying the queries and making the 
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Sc(ci)/ 


SC(c,)/'^x 


t(X,Y,a), 


So ,C 


—•Si' 


t(X,Z,c 2 ) 




sac,)/ \ 
s/ lc > 



Ss 



V {q(Y,Z):-t(X,Y, Cl ),t(X,Z,c 2 )} 

Vi \q 1 (X 1 ,Y):-t(X 1 ,Y,ci)\ q 2 (X 2 , Z):-t(X 2 , Z, c 2 )} 

V 2 {q(Y, Z, Wi):-t{X, Y,Wi), t(X, Z, c 2 )} 

V 3 {q{Y, Z, W 2 ):-t(X, Y, ci), t(X, Z, W 2 )} 

V 4 {g(^, Z, Wi, W 2 ):-t(X, Y,Wi), t(X, Z, W 2 )} 

V 5 {q 1 (X 1 ,Z,Wi):-t(X 1 ,Z.,W 1 ); q 2 (X 2 , Z):-t(X 2 , Z, c 2 )} 

V 6 {«i(Xi,Z):-t(Xi,Z,ci); g 2 (Jt 2 , Z, W 2 ):~t(X 2 , Z, W 2 )} 

V T {qi{X 1 ,Z,Wi):-t(X 1 ,Z,W 1 ); q 2 (X 2 , Z,W 2 ):-t(X 2 , Z,W 2 )} 

V 8 {q(X,Y,Z):-t(X,Y,Z)} 

Figure 3: Sample exhaustive strategy (solid arrows), ExNaive 
strategy (solid and dashed arrows), and view sets correspond- 
ing to each state. 

search space size explode (as pre-reformulation does). Thus, this is 
the best approach for situations where database saturation is not an 
option, which is also shown through our experiments in Section 6.5. 

5. SEARCHING FOR VIEW SETS 

This Section discusses strategies for navigating in the space of 
candidate view sets (or states), looking for a low- or minimal-cost 
state. We discuss the exhaustive search strategies and identify an 
interesting subset of stratified strategies in Section 5.1, based on 
which we analyze the size of the search space. In Section 5.2, we 
present several efficient optimizations and search heuristics. 

5.1 Exhaustive Search Strategies 

We define the initial state of the search as So(Q) = (Vb, Ro), 
such that Vb = Q, i.e., the set of views is exactly the set of queries, 
and each rewriting in Ro is a view scan. The state graph G(So) 
corresponds to the queries in Q. Clearly, the rewriting cost of So is 
low, since each query rewriting is simply a view scan. However, its 
space consumption and/or view maintenance costs may be high. 

We denote by S S' the application of the transition r G 
{Sc,Jc,Vb,Vf} on a state S, leading to the state S'. 

DEFINITION 5. 1 (PATH). A path is a sequence of transitions 
of the form: So ^ Si, Si S 2 , . . ., S fc _i -^4 S k . 

For instance, in Figure 3, (So Sc<C2) > S3), (S3 So) is a path. 
We may denote a path simply by its transitions, e.g., (Sc(c2), Jc). 

Theorem 5 . 1 (Completeness of the transition set). 
Given a workload Q and an initial state So, for every possible state 
S(Q), there exists a path from the initial state So to S. 

The proof is given in our technical report [25]. 

DEFINITION 5.2 (STRATEGY). A search strategy £ is a se- 
quence of transitions of the form: 

£ = (Sit — ► Sj'J, (Sj 2 — — > S' i2 ), . . . , {Si k _ 1 z -¥ S' ik l ), 

where S ix = So, for every j G [l..fe] t % . G {Sc,Jc, Vb, Vf}, and 
for every j G [2..k] there exists I < j such that S' it = S^ (each 
state but So must be attained before it is transformed). 

For example, for the one-query workload depicted at the top left 
of Figure 3, one possible strategy is: 



Algorithm 2: ExNaive(So) 

Input : an initial state So 
Output: the best state S 6 found 

1 S b <- S . S„ew <- null, CS <- {So}, ES <- 0, NS <- 

2 while CS £ do 

3 foreach state S c G CS do 

4 S new <- applyTrans({Sc,JC,Vz,VF} , S c , (ES U CS)) 

5 if Snew = null then move S c from CS to ES 

6 else 

7 CS «- CSU{S new } 

8 |_if c e (S new ) < c e (S b ) then S b «- S new 



El = {So ^ S 2 ), (S 2 S 4 ), (So ^ S 3 ), 

(S 3 ^± S 4 ),(So^ S!) 

A strategy S is exhaustive if any state S that can be reached 
through a path, is also reached in E (not necessarily through the 
same path). For instance, in Figure 3, the solid arrows depict an 
exhaustive strategy, reaching all possible states. 

We first consider a simple family of strategies called ExNAIVE 
and described through Algorithm 2. ExNAIVE strategy (as all 
strategies presented in this work) maintains a candidate state set 
CS and a set of explored states ES. CS keeps the states on which 
more transitions can be possibly applied and is initially {So}. ES 
is disjoint from CS and is empty in the beginning. A state S is 
explored, when any state S' = t(S) obtained by applying some 
transition r G {Sc, Jc, Vb, Vf} to S, already belongs either to 
CS or to ES. ExNAIVE at each point picks a state S c from CS 
and tries to apply a transition to it (applyTrans, line 4). If no 
new state is obtained, S c was already explored and is moved to ES 
(line 5); otherwise, the newly obtained state (S new ) is copied to 
CS (line 7). During the search, we also keep the best state found 
so far (denoted Sb), i.e., having the lowest cost c e (S) (line 8). The 
strategy stops when no new states can be found. Clearly, ExNAIVE 
strategies are exhaustive. In Figure 3, the solid and dashed arrows, 
together, illustrate an ExNAIVE strategy. 

For a given strategy S, the paths to a state S G E, denoted "~~*S, is 
the set of all E paths whose final state is S. In an ExNAIVE strategy 
there may be multiple paths to some states, e.g., Se, is reached twice 
in our example, which slows down the search. We define the notion 
of stratification to reduce the number of such duplicate states. 

Definition 5.3 (Stratified path). A path p e "^S for 
some state S G E is stratified iff it belongs to the regular language: 
Vb*Sc* Jc* Vf*. 

A stratified path constrains the order among the types of tran- 
sitions on the path: all possible view breaks appear only in the 
beginning of the path and are followed by the selection cuts. Join 
cuts appear only after all selection cuts are applied and are in turn 
followed by zero or more view fusions. In Figure 3, all solid-arrow 
paths starting from So are stratified. 

The following theorem formalizes the interest of stratified paths. 

Theorem 5.2 (Completeness of stratified paths). 
Let Q be a query workload and S(Q) be a state for Q. There 
exists a stratified path leading from the initial state So to S. 

The proof can be found in our technical report [25]. 
We can now identify an interesting family of strategies. 

Definition 5.4 (Stratified strategy). A strategy E is 
stratified iff for any S G E and p G "~*S, p is stratified. 

In Figure 3, any topological sort of the solid edges is a stratified 
strategy, more efficient than the ExNAIVE one illustrated in the 
Figure, since the latter performs four extra transitions. Observe 
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that a stratified strategy does not constrain the order of transitions 
that are not on the same path. For instance, in Figure 3, a stratified 

Jc 

strategy may apply the transition So — ► Si before all the Scs. 

We now define the important family of ExSTR strategies. Start- 
ing from the initial state So, an ExSTR strategy picks any state on 
which it applies any applicable transition, preserving the stratifica- 
tion of all strategy paths. Several ExSTR strategies may exist for 
a workload, differing in their ordering of the transitions. We will 
simply use ExSTR to refer to any of them. The ExNAIVE strategy 
(Algorithm 2) can be turned to an ExSTR one through the follow- 
ing modification: when applyTrans (line 4) is called on a state 
S c , it should apply the transitions in a stratified way, i.e., first it at- 
tempts a Vb and only if no new state is obtained, it applies an Sc, 
and then a Jc and, finally, a Vf. 

Theorem 5.3 (Interest of ExStr). (i) Any ExStr strat- 
egy is exhaustive, (ii) For a given workload Q, and arbitrary 
ExSTR strategy £s and ExNa'ive strategy Eat, Es has at most 
the number of transitions o/Ejv- 

The proof is given in [25]. Due to Theorem 5.3, among the 
exhaustive strategies, we will only consider wlog the stratified ones. 

Size of the search space Let the workload Q having in total n 
nodes in its initial state So. Denoting by Bj, the fc-th Bell num- 
ber (the number of partitions of a set of size k), and by p(n, k) 
the number of minimal covers with k members of a set of size 
n, for the number of candidate view sets we have NS(Q,n) < 

Y2=i 2 kn2 K n , k ) B k- The details are provided in [25]. 
Time complexity The time complexity of exhaustive search can be 
derived from the number of states created by each transition and the 
time complexity of the transition. The cost of a Sc, Jc and Vb is 
linear in the size of the largest view, which is bound by 3n, whereas 
Vf requires checking query equivalence, which is in 0(2") [7]. 

The complexity of exhaustive search is very high and, even if 
views are selected off-line and thus time is not a concern, it brings 
real issues due to memory limitations. This highlights the need for 
robust strategies with low memory needs, and efficient heuristics. 

5.2 Optimizations and heuristics 

We now discuss a set of search strategies with interesting prop- 
erties, as well as a set of pruning heuristics which may be used to 
trade off completeness for efficiency of the search. 

Depth-first search strategies (Dfs) A (stratified) strategy E is 
depth-first iff the order of E's transitions satisfies the following 
constraint. Let S be a state reached by a path p of the form Vb*. 
Immediately after S is reached, E enumerates all states recursively 
attainable from S by Sc only. This process is then repeated with 
Jc and then with Vf. The pseudocode of Dfs can be obtained by 
replacing lines 3-4 of Algorithm 2 with the following ones, where 
rec ApplyTrans returns all states that can be reached by a specific 
transition starting from a given state: 

foreach state SVb 6 {recApplyTrans(VB, So)} do 
foreach state S Sc e {recApplyTrans(Sc, S\ B )} do 
foreach state Sj c G {recApplyTrans(]C, Ssc)} do 
I foreach state SVf 6 {recApplyTrans(\V, Sj c )} do 

L i— ■ ■ ■ 



For instance, in Figure 3, the following strategy E 3 is DFS: 

E 3 = (So ^ &), (S 2 ^ s 4 ), (S, ^ S 7 ), 

(S7 — -¥ S$), (So S3), (S3 — > So) 

An advantage of DFS strategies is that they fully explore each ob- 
tained state more quickly, reducing the number of states stored in 



CS. This results in a significant reduction of the maximum mem- 
ory needs during the search compared, e.g., with ExNa'ive, which 
develops a huge number of candidates before fully exploring them. 

Aggressive view fusion (Avf) This technique can be included in 
any strategy and is based on the fact that Vf can only decrease the 
overall cost of a state (Section 3.3). Once a new state S is obtained 
through some Sc, Jc or Vb, we recursively apply on S all possi- 
ble Vfs (until no more views can be fused). It can be shown that 
such repeated Vfs converge to a single state S yr . We then discard 
all intermediate states leading from S to S' VF and add only S VF to 
CS. Thus, AVF preserves the optimality of the search, all the while 
eliminating many intermediary states whose estimated cost is guar- 
anteed to be higher than that of S F . For example, assume we reach 
a state S containing three identical views. We apply a Vf on S fus- 
ing two of the three views and obtain the state S' . We then apply a 
Vf on S' fusing the two remaining identical views and obtain 5 lVF . 
AVF discards S' and keeps only S F to continue the search. 

Greedy stratified (GSTR) This strategy starts by applying all pos- 
sible Vb transition sequences on So- It then discards all the ob- 
tained states but Sb, and repeatedly applies on it all possible Sc. 
Keeping only Sb, it proceeds in the same way by applying Jc and 
then Vf. The interest of GSTR lies in the possibility to combine it 
with the AVF technique, leading to the GStr-Avf strategy. GSTR- 
Avf has low memory needs due to the many states dropped by 
GSTR and AVF and moves fast towards lower-cost states due to 
AVF. Although neither GSTR nor GStr-Avf can guarantee opti- 
mality, they perform well in practice, as our experiments show. 

Stop conditions We use some stop conditions to limit the search 
by considering that some states are not promising and should not 
be explored. Clearly, stop conditions lead to non-exhaustive search. 
We have considered the following stop conditions for a state S. 

• stoptt(S): true if a view in S is the full triple table t. 

• stop va r(S): true if a view in S has only variables. The idea 
is that we reject S since we consider its space occupancy to 
be too high. This condition is not applicable if it is satisfied 
by the initial state, but such queries are of limited interest. 

• stoptime(S): true if the search has lasted more than a given 
amount of time. Observe that our approach is guaranteed to 
have some recommended Sb state at any time. 

6. EXPERIMENTAL EVALUATION 

This Section presents an experimental evaluation of our approach, 
which we have fully implemented as a Java 6 application. The ap- 
plication takes as input a set of conjunctive RDF queries and pos- 
sibly an RDF schema, and produces as output the set of recom- 
mended views and query rewritings. It uses a database back-end to 
store both the original RDF data and schema, and the views. 
Platform and data layout We used PostgreSQL (version 8.4.3) as 
the database back-end for its reputation as a (free) efficient plat- 
form that has been used in several related works [1, 15, 16, 20, 23]. 
Integrating our view selection approach with another platform is 
easy as soon as that platform supports the evaluation of our select- 
project-join rewritings, and provided that the cost function is appro- 
priately customized to account for the respective evaluation engine. 

As in many previous works, for efficiency, we stored the data in 
a dictionary-encoded triple table, using a distinct integer for each 
distinct URI or literal appearing in an s, p or o value. The encoding 
dictionary was stored as a separate table indexed both by the integer 
dictionary code and by the encoded constant. The triple table was 
clustered by the columns p and then s, to enhance the efficiency of 
(frequent) queries where the p values are specified in most or all 
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atoms. Moreover, we indexed the encoded triple table on s, p, o, 
and all two- and three-column combinations. 
Data and queries As in previous works [1, 15, 23], we used the 
Barton RDF dataset and RDFS [24]. The initial dataset consists of 
about 50 million triples. After some cleaning (removing format- 
ting errors, eliminating duplicates etc.) we kept about 35 million 
distinct triples. The space occupied by the encoded triple table, the 
dictionary and the indexes within PostgreSQL was 39 GB. 

The Barton query workload [24] contains few queries with no 
commonality among them. To better test our approach, we built 
two query generators, producing queries of controllable size, shape, 
and commonality. The first one simply outputs the desired queries, 
and has maximum flexibility. The second takes as input not only 
the workload characteristics, but also a dataset (RDF + RDFS) and 
generates queries having non-empty answers on the given dataset. 
We used it to obtain interesting workloads on the Barton dataset. 
Weights of cost components For VSO and REC (Section 3.3), 
we used c 3 =l and c r =l. For each workload, we set the value of 
c m taking into account the database size and the average number 
of atoms in each query, so that for the initial state So, c m ■ VMC 
is within at most two orders of magnitude from the other two cost 
components, c a ■ VSO and c r ■ REC. In most cases, this lead to 
c m =0.5. Finally, we set f=2 in VMC, since this value gave the 
most appropriate range to VMC through the search. 
Hardware and memory The PostgreSQL server ran on a separate 
2.13 GHz Intel Xeon machine with 8GB RAM. We ran search al- 
gorithms on two classes of hardware: a desktop 8-core Intel Xeon 
2.13 GHz machine with 16 GB RAM (the JVM was given 4 GB), 
and several cluster machines, each of which is a 4-core Intel Xeon 
2.33 GHz with 4 GB RAM (the JVM was given 3 GB). Each exper- 
iment ran on one machine. While there are opportunities for paral- 
lelization (see Section 8), we did not exploit them in this work. All 
machines were running Mandriva Linux 2.6.31. 

6.1 Competitor search strategies 

We have implemented the three strategies, Pruning, Greedy and 
Heuristic, introduced in the relational view selection work which 
inspired our states and transitions [21]. All these strategies follow 
a divide-and-conquer approach. They start by breaking down the 
initial state into a set of 1-query states, and apply all possible edge 
removals, then all possible view breaks on each such state. Then, 
they seek to put back together states corresponding to the complete 
workload by adding up and, when appropriate, fusing, one state for 
each workload query. Since any combination of partial states leads 
to a valid state in [21], the number of states thus created explodes. 
To avoid it, Pruning discards partial states outgrowing the given 
space or cost budget, whereas Greedy develops very few states: it 
only keeps the best combined state, say, for the workload queries 
{qi, 52}, even though this may prevent finding the best combined 
state for {q\, q2, 53}. Finally, Heuristic resembles Pruning, ex- 
cept that after having built all one-query states, it only keeps: the 
minimal-cost state for each query, and any states which offer some 
view fusing opportunity. Since our algorithms do not use a cost or 
space budget, we did not give one to the [21] strategies either. This 
does not prevent their pruning which is mostly based on comparing 
two states and discarding the less interesting one. 

Search strategy acronyms In the sequel, for convenience, we will 
refer to the [21] strategies simply as Pruning, Greedy and Heuris- 
tic. Among the strategies we propose (see Section 5.2), Dfs is the 
(stratified) depth-first search, while GSTR is the greedy strategy. 
The suffix -AVF after a strategy name denotes aggressive view fu- 
sion is applied in conjunction with that strategy, while the -STV 
suffix denotes that the stop var stop condition is used. 
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Figure 4: Strategy comparison on small workloads. 

Relative cost reduction To assess search effectiveness, we define 
the relative cost reduction (rcr) of a given strategy S and workload 
Q, at a given moment, as the ratio (c e (So) — c e (S ))/ c e (So), that 
is, the fraction of the cost of the initial state So, avoided by the 
current best state found by £ by that moment during the search. 

6.2 Comparison with existing strategies 

We compare our strategies with those of [21] for two small work- 
loads of 5 queries each. While the queries they tested involve on 
average 4 relations, one needs more RDF atoms than relations to 
express the same logical query, since data that would fit in a wide 
relational tuple is split over many RDF triples. Thus, queries in the 
first and second workload have 5 and 10 atoms each, respectively. 

Figure 4 shows the rcr of the three strategies of [21] and our 
strategies Dfs-Avf-Stv and GStr-Avf-Stv. The reasons for 
using the specific heuristics on our strategies are explained in Sec- 
tion 6.3. The Figure considers workloads of star and chain queries, 
which are typical in RDF. In particular, star queries translate to 
query graphs (Definition 3.1) that are cliques (each atom is con- 
nected to all others), allowing for many Vbs and Jcs and, therefore, 
have a search space of increased size, whereas chain queries can be 
considered an average case regarding the difficulty of the search. 
The workloads were generated both with high and low commonal- 
ity across queries and we used the stopume stop condition set to 30 
minutes. While this may seem long, recall that the complexity of 
search is high (Section 5.1). We consider this duration acceptable 
as view selection is an off-line process. The overhead is worth it 
especially for large workloads, and/or queries asked repeatedly. 

As can be seen in Figure 4, for the smaller workload, all strate- 
gies ran well, with Dfs-Avf-Stv and GStr-Avf-Stv being the 
best. The runs did not finish, i.e., the strategies might have found 
better solutions by searching longer. Greedy managed to reduce the 
cost significantly for chains but failed to find any state better than 
the initial one for stars queries. For the larger workload, the [21] 
strategies failed to produce any solution, as they outgrow the avail- 
able memory building partial states (for 1,2,3 queries etc.) be- 
fore building any state covering all 5 queries. In contrast, Dfs- 
Avf-Stv and GStr-Avf-Stv keep running and achieve interest- 
ing cost reductions. The same trend was observed on workloads 
with cycle- and random graph-shaped queries (we generated both 
sparse and dense graphs), at high and low commonality. 

Thus, from now on, and in particular for large workloads, we 
focus only on our strategies, since those of [21] systematically out- 
grow the memory before reaching a full candidate view set. 

6.3 Impact of heuristics and optimizations 

We now study the impact of the AVF and STV techniques on 
the search space explored by our algorithms. A tiny workload of 2 
queries of 4 atoms each (satisfiable on the Barton dataset) suffices 
to illustrate this (Figure 5). The queries used are star-shaped with 
low commonality; [25] shows similar results for other workloads. 
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Figure 5: Impact of heuristics on the search. 
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Figure 6: Relative cost reduction for large workloads. 

We used the Dfs strategy and several combinations of heuristics. 
The states created are those reached by the search, while duplicates 
are those already attained through a different search path (already 
belong to CS or ES\ see Section 5) and are ignored. Discarded are 
the states excluded from the search once they are created, whereas 
explored are the ones from which all outgoing transitions respecting 
DFS have been explored. In this experiment that ran in the cluster, 
all strategies completed execution and reached the same best state. 

A first remark based on Figure 5 is that the number of dupli- 
cate states may be quite important. Duplicates occur because even 
when using a stratified strategy, a state may be reached by more 
than one path. For instance, assume for some given views V\ , V2 

SC(ci), 



that an Sc modifies vi into v[ (denoted v\ 



v'{) and simi- 



larly V 2 



Sc(c 2 ) 



v' 2 . From the state (vi,V2), our algorithms reach 



the state (v[,V2) twice: once through (vi,v' 2 ) and a second time 
through (v[,v 2 ). Our algorithm identifies such states as soon as 
they are created, in order not to repeat their exploration. 

Second, Figure 5 shows that AVF (which fuses views within one 
candidate set as soon as possible) reduces the number of created 
states (while preserving optimality as explained in Section 5.2), 
because no state containing identical views is explored. A third 
remark is that STV discards a significant number of states, which 
trims down significantly all state counts. The combination AVF- 
Stv is marginally better than STV and was efficient in all the ex- 
periments we ran. Hence, we systematically use it in the sequel. 

6.4 Cost reduction on large workloads 

We study the scalability of our DFS and GSTR algorithms for 
large query workloads. To this purpose, we generated workloads 
of 5, 10, 20, 50, 100 and 200 queries; each query has 10 atoms, 
i.e., the views of the initial states contains 10 atoms on average. We 
consider workloads consisting of: star queries only; chain queries 
only; random-graph shaped queries (with two variants, dense graph 
and sparse graph); mixed, combining queries of all the previous 
shapes. For each kind of workload, we generate three low- and 
three high-commonality variants. On each of these 30 workloads, 
we ran Dfs-Avf-Stv and GStr-Avf-Stv. We used the stopume 
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Table 3: Workloads used for reformulation experiments. 
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Figure 7: Search for view sets using reformulation. 

stop condition set to 3 hours. These experiments ran in the cluster. 

Figure 6 plots for each of the 10 workload types, the rcr aver- 
aged over the 3 workloads of that type, at the end of the search. 
A first remark is that Dfs's relative cost reduction is very impres- 
sive overall, and in many cases around 0.99. Second, note that the 
rcr of GStr-Avf-Stv is generally smaller than that of Dfs-Avf- 
Stv, because GSTR explores significantly fewer states than Dfs 
and might miss interesting opportunities. Third, we can distinguish 
"easier" workloads, such as chains and random-sparse graphs, re- 
sulting in query graphs with fewer edges and, thus, fewer transi- 
tions. For such workloads, the rcr is higher since the search space 
is smaller (and bigger part of it was explored). Stars and random- 
dense graphs are difficult cases, as they lead to many edges, thus 
smaller vers. Finally, the rcrs obtained for high-commonality work- 
loads are generally higher than for low-commonality, e.g., for 
random-dense and mixed workloads. This confirms the intuition 
that more factorization opportunities lead to higher gains. Dfs- 
Avf-Stv resulted in views with 3.2 atoms in average, whereas 
GStr-Avf-Stv produced views with 6.5 atoms in average. 

We conclude that Dfs-Avf-Stv scales well up to 200 queries, 
depending on the workload structural complexity, and can achieve 
very significant reductions in the state cost. 

6.5 View selection and implicit triples 

We study the impact of implicit triples on view selection perfor- 
mance. Starting from a non-saturated database D and workload Q, 
three scenarios are possible: (i) saturated database D s , search on 
Q and the statistics of D 3 ; (ii) original database D, search on the 
pre-reformulated workload Q r and the statistics of D; (iii) original 
database D, search on Q with the statistics of the saturated database 
D s (recall from Section 4.3 that we gather them without actually 
saturating the database). Of course, we consider the same RDF 
entailment rules for the three scenarios, i.e., those brought by an 
RDFS. Saturation and post-reformulation coincide for any search 
algorithm, since they lead to the same input statistics and workload. 
Hence, we only study the search for pre- and post-reformulation. 

This experiment uses the Barton dataset as well. The schema 
consists of 39 classes, 61 properties, and 106 RDFS statements 
of the kinds listed in Table 1. We generated two satisfiable work- 
loads Qi and Q 2 , whose properties and those of their reformulated 
versions Q\ and Q 2 are characterized in Table 3; \Q\ denotes the 
number of queries in Q, #a(Q) the number of atoms and #c(Q) 
the number of constants. Q\ is a subset of Q 2 . 

Figure 7 shows the evolution of the best cost found by Dfs- 
Avf-Stv for both workloads (post-reformulation) and their refor- 
mulated variants (pre-reformulation). The search was cut after 3 
hours. We see that the initial state for reformulated workloads has 
higher cost than the original workloads. Further, the best state cost 
decreases rapidly with post-reformulation, because the workload is 
much smaller and the search space is traversed faster. In contrast, 
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Figure 8: Execution times for queries with RDFS. 

the important workload sizes slow down the cost decrease for pre- 
reformulation. The best cost of pre-reformulation is higher than 
that of post-reformulation, by a factor of 2.7 for Qi, and 22 for 
Q2- This confirms our expectation that the advantages of post- 
reformulation are most visible for larger workloads (with larger 
Q r ). Moreover, the best cost is reached faster in post-reformulation. 

In general, the number of implicit triples increases with the size 
of the database D and of the schema 5. We show in [25] that 
the bound is 0(|Dj x \S\) for the considered RDFS entailment 
rules. Similarly, \Q r \ may be the same as \Q\, or exponentially 
larger (Theorem 4.1). In a reformulation-based setting, view se- 
lection based on post-reformulation is clearly better than based 
on pre-reformulation, since the initial state is better and search is 
faster, especially for large workloads. Among saturation and post- 
reformulation, the best choice strongly depends on the context (dis- 
tribution, rights to update the database, frequency and types of up- 
dates etc.) as explained in Section 4.2. The views recommended in 
a saturation and a post-reformulation context are the same. 

6.6 View-based query evaluation 

We now study the benefits that our recommended views actu- 
ally bring to query evaluation (recall though that our view selection 
does not optimize for query evaluation only, but for a combination 
including storage and maintenance costs). For the workload Q\ de- 
scribed in Section 6.5, we materialized the views recommended by 
pre- and post-reformulation, and ran the 5 queries Q\ to Q\ of Qi 
using (i) the views, (ii) the (dictionary-encoded, heavily indexed) 
saturated triple table in PostgreSQL, (Hi) a restricted version of (ii) 
only with the triples needed for answering Q%, (iv) RDF-3X [17] 
(loading the saturated database in it), and (v) the materialization of 
the query workload (initial state). RDF-3X times were put as a ref- 
erence; by using PostgreSQL (even with views) we did not expect 
to get better times than those of the state-of-the-art RDF platform. 

The views were materialized in 81 seconds for post-reformulation 
(the total view size was 433 MB or 15% of the database size), and 
103 seconds for pre-reformulation (601 MB or 21% of the database 
size). Figure 8 shows that using our views, queries are evaluated 
more than an order of magnitude faster than on the triple table, 
even when using the restricted triple table (Hi). Both pre- and post- 
reformulation performed in the range of RDF-3X. This is a promis- 
ing result, since our approach can be used on top of RDF-3X and 
achieve an even bigger gain. Finally, as expected, materializing the 
queries gives the best results (simply scanning the views is suffi- 
cient). More experiments are provided in [25]. 

Pre-computed views are likely to speed up query evaluation in 
any platform, simply by avoiding computations at runtime. More- 
over, our framework (i) avoids the overhead of query rewriting at 
run-time, as query rewritings are also pre-computed and (ii) could 
easily translate our rewritings directly to any RDF platform's logi- 
cal plans, exploiting its physical optimization capabilities. 

6.7 Experiment conclusion 

Our experiments have shown that the GSTR and Dfs strategies 
scale well on up to 200 queries and achieve impressive cost reduc- 



tion factors in many cases close to 99%. The strategies of [21] are 
also effective for small workloads, but they outgrow the memory on 
larger ones before producing a solution. The AVF and Stv heuris- 
tics are efficient and effective, i.e., they reduce the search space 
while preserving view set quality. Post-reformulation largely out- 
performs pre-reformulation in terms of speed and effectiveness of 
the candidate view set selection. Finally, our recommended views 
do reduce query evaluation times by several orders of magnitude. 

A tighter integration of the view selection tool with the internals 
of the data management platform, and/or using a dedicated RDF 
system, is likely to increase performance gains even more. 

7. RELATED WORKS 

Our work is among the first to explore materialized view selec- 
tion in RDF databases. The closest works related to ours are [6] 
and [9]. RDFMatView [6] recommends RDF indices to materialize 
for a given workload, while in [9] a set of path expressions appear- 
ing in the given workload is selected to be materialized, both aim- 
ing at improving the performance of query evaluation. Unlike our 
approach, none of these works aims at rewriting the queries com- 
pletely using the materialized indices or paths and, thus, cannot be 
used in scenarios where the client needs to process her queries even 
without access to the database. Moreover, they do not consider the 
implicit triples that are inherent to RDF. 

Commonly used RDF management platforms (e.g., Sesame, 
3 store or Jena) are based on a relatively simple mapping of triples 
within a relational database. Many works have addressed the ef- 
ficient processing of RDF queries and updates [1, 15, 16, 17, 20, 
22, 23], proposing various storage and indexing models. In vertical 
partitioning [1] one (s, o) relation is created for each property value 
(possibly leading to large unions for queries with variables in the 
p position). The authors of [16, 17] have built RDF-3X, a native 
RDF query engine. In many of the approaches, the (s, p, o) table is 
indexed in multiple ways (by each attribute, each pair of attributes 
etc.), a technique originally introduced in [23]. Recently, the prob- 
lem of view-based SPARQL query rewriting was studied in [13]. 
These techniques have been shown to result in good RDF query 
and update performance. We view our approach as complementary 
to these works, since we seek to identify materialized views to store 
on top (independently) o/the base store and indexes. To adapt our 
approach to a specific RDF platform, one only needs (i) an execu- 
tion framework capable of evaluating our simple select-project-join 
rewritings and (ii) possibly, tailoring the cost function to the partic- 
ularities of the platform. Our approach improves performance by 
exploiting pre-computed results and thus avoiding computations at 
query evaluation time, gains likely to extend to any context. 

The main results on query rewriting for answering queries using 
views are surveyed in [11]. In contrast with query rewriting algo- 
rithms, views are not part of the input of view selection, but are part 
of the output together with the rewritings. In particular, and follow- 
ing [21], our view selection algorithm generates rewritings while 
searching for candidate views. As for the rewritings themselves, 
view selection produces equivalent rewritings, as query rewriting 
does in the setting of query optimization, while query rewriting for 
data integration typically produces maximally-contained rewritings 
due to the incompleteness of the data sources. 

Materialized view selection has been intensely studied in rela- 
tional databases [8] and data warehouses [12]. We used [21] as a 
starting point for our work, as it is one of the prevalent works in 
the area and the closest to our problem definition and query lan- 
guage. However, in [21] the restriction that no relation may appear 
twice in a workload query is imposed, under which view equiva- 
lence can be tested in PTIME. This simplification is incompatible 
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with RDF queries, which repeatedly use the triple table. In our 
context, determining view equivalence (needed for Vf and for the 
search strategies) is NP-complete [7]. This, along with the typically 
bigger size of RDF queries compared to the relational ones (since 
only one table with three attributes is used), increase the complexity 
of the problem even more. Hence, the strategies presented in [21] 
are not effective in our context. We innovate over [21] by propos- 
ing new search strategies and heuristics, which, as demonstrated in 
Section 6, do not suffer from memory limitations and lead to the 
selection of efficient views, even if we limit the time of the search. 
Furthermore, there are some differences between our transitions 
and those in [21], due to the differences between their SQL-like 
language and our Datalog formalism (for more details see [25]). 

Multi-query optimization [28] and partial view materialization 
[29] are also related works. Unlike our approach, none of them 
aims to completely rewrite the queries using the views. In [28], 
common query subexpressions among the queries are recognized to 
be materialized. Views with disjunctions are supported, which we 
also plan to do as future work. In [29] views are only partially ma- 
terialized and their content is adjusted as the queries change, which 
is another difference with our work (we consider static queries). 

Query reformulation (a.k.a. unfolding) is directly related to query 
answering under constraints interpreted in an open-world assump- 
tion (e.g., [19]), i.e., when constraints are used as deductive rules. 
In particular, our query reformulation algorithm builds on those 
in the literature considering the so-called Description Logic (DL) 
fragment of RDF [3, 5], i.e., description logic constraints. This 
fragment corresponds to RDF databases without blank nodes that 
are made of an RDFS, called a Tbox, and a dataset made of as- 
sertions for classes and properties in the RDFS, called an Abox, 
i.e., well-formed triples of the form (s, rdf :type, c) or (s,p, o), 
where c is a class and p a property of the RDFS. Lastly, the RDF 
entailment rules considered are only those dedicated to an RDFS 
(see Section 4.1). Reformulation algorithms for the DL fragment 
of RDF actually reformulate queries from a strictly less expressive 
language than our RDF queries. They only support atoms in which 
the class or the property is specified, i.e., they do not support atoms 
of the form t(s, rdf :type, X) or t(s, X, o) with X a variable. To 
overcome this, our reformulation algorithm extends the state of the 
art to our RDF queries, i.e., the BGP of SPARQL. 

An early version of this work was demonstrated in [10]. 

8. CONCLUSION AND FUTURE WORK 

We considered the setting of a Semantic Web database, including 
both explicit data encoded in RDF triples, and implicit data, derived 
from the RDF entailment rules [26]. Implicit data is important since 
correctly evaluating a query against an RDF database also requires 
taking it into account. In this context, we have addressed the prob- 
lem of efficiently recommending a set of views to materialize, min- 
imizing a combination of query evaluation, view storage and view 
maintenance costs. Starting from an existing relational approach, 
we have proposed new search algorithms and shown that they scale 
to large query workloads, for which previous search algorithms fail. 
Our view selection approach can be used as well with a saturated 
RDF database (where all implicit triples are added explicitly to the 
data), or with a non-saturated one (when queries need to be refor- 
mulated to reflect implicit triples). We have proposed a new algo- 
rithm for reformulating queries based on an RDF Schema, as well 
as a novel post-reformulation method for taking into account im- 
plicit triples in a query reformulation context. Post-reformulation 
can be much more efficient than naive pre-reformulation, due to the 
high complexity of view search in the number of queries. 

As future work, we consider parallelizing our view search al- 



gorithms by identifying workload queries that do not have many 
commonalities and running the search in parallel for each group. 
We also consider extending our query and view language, as well 
as adapting our approach to dynamic query workloads. 
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